Skip to content

mkanoor/agents

Repository files navigation

Red Hat Auto-Remediation Workflow Generator

Automated diagnostic and remediation workflow generation for Red Hat Enterprise Linux (RHEL) systems using AI and MCP (Model Context Protocol) servers.

Powered by OpenRouter: Access Claude, GPT-4, Gemini, and other leading models through a single API.

Overview

This project provides an end-to-end pipeline that:

  1. Diagnoses system issues from error logs using Red Hat's Security Data API and Knowledge Base
  2. Generates executable remediation workflows with proper error handling and approvals
  3. Produces workflow definitions compatible with the Nexus Workflow Engine

Architecture

Error Logs
    ↓
[Diagnostic Agent]
    ↓ (uses MCP servers)
    ├── Red Hat Security Data API (CVEs, advisories)
    └── Red Hat Knowledge Base (solutions, articles)
    ↓
Diagnosis (root causes + remediation steps)
    ↓
[Workflow Generator]
    ↓
Executable Workflow Definition (JSON)
    ↓
[Workflow Engine] (execution - not included)

See ARCHITECTURE.md for detailed visual diagrams of the entire process flow, MCP server architecture, agentic research loop, and data flow.

Features

  • Agentic Research: LLM autonomously researches using MCP tools
  • Multi-source Intelligence: Combines CVE data + KB articles
  • Structured Output: Generates valid workflow JSON
  • Risk Assessment: Assigns risk levels and approval requirements
  • Retry Policies: Automatic retry configuration based on risk
  • Checkpointing: Saves diagnosis and workflow at each stage

Project Structure

redhat-diagnostic-workflow/
├── mcp_servers/
│   ├── redhat_security_server.py  # MCP server for Security Data API
│   └── redhat_kb_server.py         # MCP server for Knowledge Base API
├── diagnostic_agent/
│   ├── diagnostic_agent.py         # Main diagnostic agent
│   ├── workflow_generator.py       # Workflow generator
│   └── pipeline.py                 # Complete orchestration pipeline
├── examples/
│   ├── nginx_openssl_error.log     # Example: nginx segfault
│   └── systemd_timeout_error.log   # Example: systemd/PostgreSQL issue
├── requirements.txt                # Python dependencies
├── .env.example                    # Environment variable template
├── test_redhat_access.py           # Red Hat API connectivity test
├── run_demo.sh                     # Demo script
├── README.md                       # This file
├── QUICKSTART.md                   # Quick start guide
├── OPENROUTER.md                   # OpenRouter setup and usage
├── TESTING.md                      # Testing guide and troubleshooting
├── ARCHITECTURE.md                 # Visual architecture and process flow
└── CHANGELOG.md                    # Project changelog

Prerequisites

  1. Python 3.10+
  2. OpenRouter API Key (Get here) ← Required
    • Access to Claude, GPT-4, Gemini, and more
    • No waitlist, pay-as-you-go pricing
    • See OPENROUTER.md for setup and model selection
  3. (Optional) Red Hat Customer Portal credentials for authenticated KB access
  4. Nexus Workflow Schema (if using with Nexus)

Installation

1. Clone or navigate to the project directory

cd ~/scratch/redhat-diagnostic-workflow

2. Create virtual environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Set up environment variables

cp .env.example .env

Edit .env and add your API key:

# OpenRouter API Key (required)
OPENROUTER_API_KEY=sk-or-v1-...

# Optional: Red Hat credentials for KB access
REDHAT_USERNAME=your-username
REDHAT_PASSWORD=your-password

Note: Red Hat credentials are only needed for authenticated KB endpoints. The Security Data API is public.

Testing Red Hat API Access

Before running the full pipeline, you can test connectivity to Red Hat APIs without requiring an LLM API key:

# Activate virtual environment
source venv/bin/activate

# Run Red Hat API access tests
python test_redhat_access.py

This will test:

  • Red Hat Security Data API (public, no auth required)
    • CVE lookups
    • Security advisories
    • Package vulnerability searches
  • Red Hat Knowledge Base API (optional auth)
    • KB article search
    • Solution lookups
  • MCP server file checks

Expected output:

========================================
RED HAT API ACCESS TEST
========================================

TEST 1: Red Hat Security Data API (Public)
========================================

Get specific CVE
   Testing CVE lookup for CVE-2024-6387 (OpenSSH vulnerability)
   URL: https://access.redhat.com/labs/securitydataapi/cve/CVE-2024-6387.json
   SUCCESS (200 OK)
      CVE ID: CVE-2024-6387
      Severity: High
      CVSS3 Score: 8.1

...

TOTAL: 5/5 tests passed
All tests passed! Red Hat API access is working.

If tests fail, check:

  • Internet connectivity to access.redhat.com
  • Firewall settings
  • Red Hat credentials (for KB API tests)

See TESTING.md for detailed testing guide and troubleshooting.

Usage

Basic Usage

# Activate virtual environment
source venv/bin/activate

# Run pipeline with example error log
python diagnostic_agent/pipeline.py \
  --logs examples/nginx_openssl_error.log \
  --schema /path/to/workflow-definition.schema.json \
  --output-dir ./output

Advanced Usage

# Use custom session ID
python diagnostic_agent/pipeline.py \
  --logs examples/systemd_timeout_error.log \
  --schema /path/to/workflow-definition.schema.json \
  --session-id "incident-2025-12-03-001" \
  --output-dir ./output

# Pass error message directly (not from file)
python diagnostic_agent/pipeline.py \
  --logs "nginx segfault in libssl.so" \
  --schema /path/to/workflow-definition.schema.json

Output Structure

After running, the pipeline creates:

output/
└── incident-20251203-142345/
    ├── error_logs.txt         # Original error logs
    ├── diagnosis.json         # Diagnostic results
    ├── workflow.json          # Generated workflow
    └── summary.json           # Complete session summary

MCP Servers

Red Hat Security Server

Provides access to Red Hat Security Data API:

Tools:

  • search_cve: Search CVEs by ID or package name
  • get_rhsa: Get security advisory details
  • search_affected_packages: Find affected packages for a CVE
  • get_errata: Get errata information

Example standalone usage:

python mcp_servers/redhat_security_server.py

Red Hat Knowledge Base Server

Provides access to Red Hat Customer Portal KB:

Tools:

  • search_kb: Search KB articles
  • get_kb_article: Get full article by ID
  • search_solutions: Search for error message solutions
  • search_by_symptom: Search by symptom description

Example standalone usage:

export REDHAT_USERNAME=your-username
export REDHAT_PASSWORD=your-password
python mcp_servers/redhat_kb_server.py

Example Workflows

See examples/README.md for complete documentation of all 9 example scenarios.

Example 1: Nginx OpenSSL Vulnerability

Input (examples/nginx_openssl_error.log):

ERROR nginx: worker process exited on signal 11 (core dumped)
ERROR kernel: nginx[1234]: segfault in libssl.so.1.1

Diagnosis:

  • Root cause: Vulnerable OpenSSL 1.1.1k (CVE-XXXX-YYYY)
  • Severity: High
  • Evidence: Segfault in libssl + CVE match

Generated Workflow:

  1. Backup nginx configuration (script, low risk)
  2. Stop nginx service (script, high risk, requires approval)
  3. Upgrade OpenSSL (ansible, high risk, requires approval)
  4. Restart nginx (script, medium risk)
  5. Verify health (API call, low risk)

Example 2: PostgreSQL SELinux Denial

Input (examples/systemd_timeout_error.log):

ERROR systemd: postgresql.service: Start operation timed out
ERROR postgresql: could not open file: Permission denied
ERROR selinux: AVC denial: denied read access

Diagnosis:

  • Root cause: SELinux context mismatch on PostgreSQL data directory
  • Severity: Medium
  • Evidence: Permission denied + AVC denial

Generated Workflow:

  1. Check current SELinux context (script, low risk)
  2. Restore correct SELinux context (script, medium risk, requires approval)
  3. Restart PostgreSQL (script, medium risk)
  4. Verify database accessibility (API call, low risk)

Workflow Schema Compatibility

The generated workflows match the Nexus Workflow Engine schema:

{
  "schemaVersion": "1.0.0",
  "version": 1,
  "metadata": {
    "name": "auto-remediation-20251203-142345",
    "description": "Fix nginx segfault due to CVE-XXXX-YYYY",
    "tags": ["auto-remediation", "redhat", "CVE-XXXX-YYYY"]
  },
  "triggers": [{"type": "manual", "requiresApproval": true}],
  "workflow": {
    "activities": [...]
  }
}

Configuration

Retry Policies

Automatically configured based on risk level:

  • High risk: 1 attempt, fixed backoff
  • Medium risk: 2 attempts, exponential backoff
  • Low risk: 3 attempts, exponential backoff

Approval Requirements

Activities requiring approval:

  • All high-risk operations
  • Service restarts
  • Package upgrades
  • Manual intervention steps

Approval timeout: 10 minutes (configurable)

Troubleshooting

Issue: "Authentication required" for KB search

Solution: Set Red Hat credentials in .env:

REDHAT_USERNAME=your-username
REDHAT_PASSWORD=your-password

Issue: "CVE not found in Red Hat database"

Cause: The CVE may not affect Red Hat products or hasn't been analyzed yet.

Solution: The agent will fallback to KB article search.

Issue: MCP server connection fails

Solution: Ensure Python path is correct in diagnostic_agent.py:

StdioServerParameters(
    command="python",  # or "python3"
    args=["path/to/server.py"]
)

Issue: Workflow validation fails

Cause: Generated workflow doesn't match schema.

Solution: Check workflow-definition.schema.json path and ensure it's the correct version.

API Rate Limits

Red Hat Security Data API

  • Public: No authentication required
  • Rate limit: Reasonable use (no official limit documented)

Red Hat Customer Portal API

  • Authentication: Required for some KB endpoints
  • Rate limit: Not publicly documented

Development

Running Tests (Coming Soon)

pytest tests/

Adding New MCP Tools

  1. Edit mcp_servers/redhat_security_server.py or redhat_kb_server.py
  2. Add new tool to @app.list_tools()
  3. Implement handler in @app.call_tool()
  4. Update agent prompt in diagnostic_agent.py

Limitations

  1. Execution: Workflow execution not implemented (generates definitions only)
  2. Ansible playbooks: Discovery works, but actual playbooks not included
  3. System access: Cannot directly query the failing system
  4. Context limits: Very large log files may need pre-processing

Future Enhancements

  • Integration with Ansible Galaxy for playbook discovery
  • Real-time log streaming support
  • Multi-server diagnostics (cluster-wide issues)
  • Workflow execution engine integration
  • Automated rollback on failure
  • Metrics and observability

References

License

MIT License (or your preferred license)

Support

For issues or questions:

  • Check the troubleshooting section above
  • Review example error logs in examples/
  • Consult Red Hat API documentation

Built with: OpenRouter, MCP (Model Context Protocol), Red Hat APIs

About

My AI Agents Experiment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors